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and the Logical Consistency of Texts 
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Abstract — The article presents a new interpretation for Zipf 's 
law in natural language which relies on two areas of information 
theory. Firstly, we reformulate the problem of grammar-based 
compression and, secondly, we investigate properties of strongly 
nonergodic stationary processes. The motivation for the joint 
discussion is to prove a proposition with a simple informal 
statement: If an 7i-letter long text describes independent 
elementary facts in a random but repetitive way then the text 
contains at least n'^/logn different words. 

In the formal statement, two speciflc postulates are adopted. 
Firstly, the words are understood as the nonterminal symbols of 
the shortest grammar-based encoding of the text. Secondly, the 
texts are assumed to be emitted by a strongly nonergodic source, 
where the described elementary facts are binary IID variables 
asymptotically predictable in a shift-invariant way. 

The proof of the formal proposition applies several new tools. 
These are: a construction of universal grammar-based codes for 
which the differences of code lengths can be bounded easily, 
ergodic decomposition theorems for mutual information between 
the past and future of a stationary process, and a lemma that 
bounds differences of a sublinear function. 

Linguistic relevance of the presented modeling assumptions, 
theorems, definitions, and examples is discussed in parallel. While 
searching for concrete processes to which our proposition can be 
applied, we introduce several instances of strongly nonergodic 
processes. In particular, we define the subclass of accessible 
description processes, which formalizes the notion of texts that 
describe facts in a self-contained way. 

Index Terms — Zipf's law, universal source coding, grammar- 
based codes, smallest grammar problem, ergodic decomposition, 
excess entropy, nonergodic processes, language models, sublinear 
functions, variable-length coding 
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I. The problem statement 

"If a Martian scientist sitting before his radio in 
Mars accidentally received from Earth the broadcast 
of an extensive speech [...], what criteria would he 
have to determine whether the reception represented 
the effect of animate process on Earth, or merely the 
latest thunderstorm on Earth? It seems that the only 
criteria would be the arrangement of occurrences of 
the elements, and the only clue to the animate origin 
would be this: the arrangement of the occurrences 
would be neither of rigidly fixed regularity such as 
frequently found in wave emissions of purely physi- 
cal origin nor yet a completely random scattering of 
the same." 

G. K. Zipf [4, page 187] 

The aim of this paper is to present a new explanation 
for the empirical distribution of words in natural language. 
To achieve this goal, we shall reformulate the problem of 
grammar-based compression [5], [6] and we will research 
information-theoretic properties of a subclass of strongly non- 
ergodic stationary processes. Thus both linguists and informa- 
tion theorists may find this paper interesting. 

From the empirical point of view, the distribution of words 
is quite well described by the celebrated Zipf-Mandelbrot law 
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[4], [7], which states that the word frequency in a text is an 
inverse power of the word rank. Some effort in probability 
theory has been devoted to inferring this law for various 
idealized settings. The most famous one is the monkey-typing 
model. In this model, the consecutive characters of the text are 
modeled as IID variables assuming values of both letters and 
spaces whereas the Zipf-Mandelbrot law is obeyed by strings 
of letters delimited by spaces [7], [8], [9]. There were also 
given some derivations of the Zipf-Mandelbrot law as a result 
of multiplicative processes [10], [11] or games [12]. 

The probabilistic explanations that have been found out so 
far may be considered unsatisfactory from the linguistic point 
of view. The main source of dissatisfaction is the intuition that 
fairly nothing is purely random or regular in human language, 
cf. also [13]. The explanation proposed in this paper, based on 
previous partial insights [14], [15], [1], [3], addresses some of 
such concerns. To the best of our knowledge, two modelling 
challenges will be taken into account for the first time: 

(i) Words in the linguistic sense are some very nonarbitrary 
constituents of texts since they can be delimited in the 
text even when the spaces are absent. 

(ii) Texts in the linguistic sense refer to many facts unknown 
a priori to the reader but they usually do this in a con- 
sistent and repetitive way. 

Rather than the original Zipf-Mandelbrot law, we shall 
consider its integrated version, usually called Herdan's or 
Heaps' law in the English literature. This law says that the 
number of distinct words observed in a text is proportional to 
a power of the text length [16], [17], [18], [19]. The claim can 
be inferred from the original Zipf-Mandelbrot law assuming 
certain regularity of text growth [20], [21]. 

Thus the interest of this paper will be focused on proving 
a proposition which can be simply expressed in the following 
very informal way, assuming thereafter (3 G (0, 1): 
(I) If an n-letter long text describes ri*^ independent facts in 
a consistent way then the text contains at least / \ogn 
different words. 
Thesis (I) resonates with the ideas of semantic information 
developed many years ago by Bar-Hillel and Carnap [22] but 
it will be formalized and proved here using concepts of the 
modern Shannon information theory, including newly derived 
results. To argue that texts in natural language can describe 
so many independent facts, we will present simple stochastic 
processes with appealing linguistic interpretations. 

So as to translate thesis (I) into a provable statement, we will 
assume several specific modeling postulates, the plausibility of 
which is discussed below: 

The definition of words in texts: Firstly, the set of words 
contained in a text will be understood as the set of letter 
strings that are repeated within the text significantly many 
times. Rough empirical correspondence between such letter 
chunks and words in the linguistic sense has been observed 
for texts in some natural languages [23], [24], [25], [26]. 
Developing the ideas of the cited authors, the letter chunks will 
be understood specifically as the distinct nonterminal symbols 
in the shortest grammar-based encoding of the text. On the 
other hand, simple string repeats can be shown to abound in 



the outputs of memoryless sources, cf. [27], [15], and they do 
not appear so meaningful for linguists. 

Grammar-based codes [5], [6] are uniquely decodable codes 
which compress strings by transforming them first into special 
context-free grammars and then encoding the grammars as less 
redundant strings. An example of such a grammar is 

^2 A2 A4 As dear.cliildrenAs Aaall. 

AsyouAs 

A4_tO_ ) . (1) 

Good_morning 



A3 



If we start the derivation with symbol Ai and follow the 
rewriting rules, we obtain a predecessor of the song Happy 
Birthday to You, the latter debatedly copyrighted. 

In the compressions of longer texts, nonterminals Ai often 
correspond to words or set phrases in the linguistic sense (like 
New York), especially if it is additionally required that the 
nonterminals were defined as strings of only terminal symbols 
[26]. 

Thus the number of distinct nonterminal symbols in 
a grammar-based compression, which equals 5 for example 
([Til, will be henceforth called the vocabulary size of the gram- 
mar. A lower bound for the vocabulary size of some specific 
grammar will be given in terms of the number of independent 
facts described by the compressed text. The suitable grammar 
minimizes certain natural grammar length function, which has 
not been considered in the information-theoretic literature [5], 
[6] but is close to the one used in the computational linguistic 
experiments [24], [26] Q 

Tlie definition of facts described by texts: In the second 
turn, we have to make precise the notion of a corpus (a 
collection) of texts that describe random facts in a repetitive 
way. Both the corpus of texts and the state of affairs repeatedly 
described in the corpus will be modeled as random variables. 

Let Zk, k = 1,2,3,..., be the logical values (true or 
false), with respect to the random state of affairs, of certain 
systematically enumerated logically independent propositions. 
We assume that Z^s, when interpreted as random variables, 
are equidistributed and probabilistically independent. Such 
variables exist if the space of possible states of affairs is 
sufficiently complex, namely, if the possible states of affairs 
generate a nonatomic a-field [3]. Z^s will be called (ele- 
mentary) facts. On the other hand, let Xi, i — 1,2,3, be 

'Notwithstanding the adopted definition of the vocabulary size, we are 
aware that the proportionality between the number of distinct nonterminals 
in the smallest grammar and the number of different words in the linguistic 
sense can be valid only approximately. Let us notice that hapaxes, i.e., words 
that appear just once in the text, cannot be recognized as nonterminals by 
a good grammar-based compressor. By Zipf's law for middle-sized texts, 
roughly every second distinct word is a hapax. But then the number of distinct 
nonterminals and that of distinct words can be proportional. 

The situation gets more complicated for very short and very long texts, 
where the proportion of hapaxes varies [28], [29]. With the text length, 
this proportion decreases and there appear many repeatedly used multiword 
expressions, a.k.a. set phrases, recognized as convenient nonterminals by 
a good compressor [24]. Moreover, the vocabulary growth with the text size 
depends sharply on whether the text was written by a single author or whether 
it is a multiauthor collection. The proportion of hapaxes ultimately decreases 
exponentially in the first case [29], whereas it seems to stay away constantly 
from zero in the second one [30], [29]. 
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consecutive text units of a fixed level, such as letters, words, 
or sentences. We suppose that each fact Zk can be ultimately 
inferred from the corpus if we start reading it from an arbitrary 
position. 

Formally, let {Xi)i,=.i be a stochastic process on a probabil- 
ity space (f2, J , P), where variables : — > X assume 
values from a fixed countable set X (called the alphabet). 
Notation X^-.n ■— {Xi)„i<k<n will be used for strings of the 
variables (also called blocks). The following definition was 
introduced in [3] and captures what we need: 

Definition 1.1: A stochastic process {Xi)i^z is called 
strongly nonergodi^ if there exists a binary process 
{Zk)keti ^ IID (i.e., independent identically distributed vari- 
ables) with P{Zk 0) P{Zk = 1) = 1/2 and there exist 
functions Sfc : X* ^ {0, 1}, fc = 1, 2, 3, such that 

lim P{sk{Xt+i..t+n) ^Zk)^l (2) 

n^oo 

for all t e Z. 

It has been often supposed that the generation of texts 
should be modeled by a nonergodic process [31, Section 6.4]. 
In fact, a stationary process is strongly nonergodic if and only 
if it has a nonatomic shift-invariant sub-cr-field [3]. Moreover, 
a strongly nonergodic process cannot be an IID process or 
a finite state hidden Markov process, the kinds of processes 
considered in the monkey typing explanations. 

The number of facts described by the text Xi-n will be 
identified with the number of Z/s that may be predicted with 
probability at least S given Xi.„. That is, this number will be 
understood as the cardinality of set 

Usin) -.^{keN-.P (sfc {Xi,a) = Zk) > 5} , (3) 

where S > ^. 

To illustrate how the abstract concept of a strongly non- 
ergodic process matches some preconceptions about human 
language communication, let us consider the following exam- 
ple. Let the alphabet be X = N x {0, 1} and let the process 
{Xi)i^z have the form 

X,:={K,,Zk,), (4) 

where {Zh)keN and {Ki)i^z are probabilistically independent 
whereas {Ki)i,=z is such an ergodic stationary process that 
P{Ki = k) > for every natural number k E N. For such 
assumptions it will be demonstrated that variables (|4]i form 
a strongly nonergodic process. In particular, the cardinality of 
set Us{n) is of order if we assume {Ki)i^z ^ HD with 
P{K, = k)(x fc-i/'^. 

The variables Xi = {Ki,ZKi) can be given some formal 
semantic interpretation. Imagine that {Xi)i(zz is a sequence 
of consecutive statements extracted from a random collection 
of texts which describe some random state of affairs {Zk)ker'S 
consistently. Each statement of form Xi = (fc, z) asserts that 
the value of a random fc-th bit of the state of affairs is z, i.e., 
it affirms that Zk = z in such way that both the bit address 
k and its value z can be identified. Logical consistency of 
the description is reflected in the following property: If two 

not so fortunate name uncountable description process was used 
originally in [3]. 



Statements Xi = {k,z) and Xj = {k' , z') happen to describe 
bits of the same address (fc = k') then they always assert the 
same bit value (z = z')ll 

Other modeling assumptions: Although example ^ 
clearly illustrates the linguistic relevance of certain strongly 
nonergodic processes, the stochastic processes for which 
proposition (I) will be established rigorously do not have 
the specific form (|4|i. For technical reasons, the alphabet X 
will be assumed finite. Moreover, we shall assume that the 
probabilistic source which generates the texts is a stationary 
finite-energy process. Finite-energy processes are processes 
with exponentially dumped conditional block probabilities 
[36]. Such a condition is satisfied for processes dithered with 
an IID noise [36] — so it seems reasonable in the context 
of natural language modeling. Assuming stationarity and the 
finite alphabet for natural language models has also a long 
tradition in information theory [37]. 

The plain-word statement of thesis (I) conceals its linkage 
with information theory and its historical origin. For the 
stationary process {Xi)i,=.z of discrete variables Xi, let us 
define the n-symbol block entropy H{n) :— H (Xf+i-.t+n) = 
— E log P{Xt+i:t+n), E being the expectation operator. Then 
denote the block mutual information as 

E{n) := 2H{n) - (2n) = /(Xi^„; X„+i:2n). (5) 

called the n-symbol excess entropy after [38]. 

The supposition that E{n) oc for natural language, with 
/3 « 1/2, was raised by Hilberg [39], who interpreted in this 
way the graph in Shannon's seminal paper [37], cf. also [40], 
[41], [42], [43], [44], [38]. Hence we refer to this proposition 
as Hilberg's thesis (or Hilberg's law). Hilberg's thesis provided 
a direct inspiration for our research of proposition (I), since 
this can be split into two more specific assertions: 

^Although the concept of a strongly nonergodic process is more general 
than example (4), it formalizes an optimistic vision of human communication. 
For an infinite collection of texts (Xi)igz there is an infinite collection of 
independent elementary facts (^fc)fcgN which are unknown to the text reader 
but being referred to in the texts. There is a fixed method of interpreting finite 
texts to infer these facts, namely functions sj. that represent human language 
competence. They allow readers to determine any fact with a growing 
certainty the more texts they read, regardless of their starting point. 

We may say that the assumed shift-invaiiance of successful prediction J2j 
reflects the intuition that human language competence does not change over 
generations of readers. Exposed to exactly the same collection of texts from 
their birth, two ideal readers would understand them in the same way (i.e., 
they would predict the same values of Z^)- 

Our modeling cries, of course, for a concrete semantic interpretation of the 
elementary facts {Zf;)k0i- Some people might think of the halting probability 
n, an incompressible infinite binary sequence which formally represents 
certain amount of timeless independent truths pursued by mathematicians [32, 
Section 4], [33], [34, Section 3.6.2]. To an uninformed reader, the binary 
expansion of Q may look like a typical probabilistically random sequence. 
We doubt, however, that the bits of Q could be guessed at the required rate 
by a human being who has no access to a supernatural power 

The facts that are repetitively described in the everyday language usage 
seem to be of more accidental nature and easier to infer Moreover, since 
the existing world is unique and probabilities are mostly theoretical concepts, 
it is advisable to regai'd the main result of this paper as a weaker version 
of some yet unknown statement in the algorithmic information theory. That 
hypothetical proposition could deal with individual real texts and different 
particular worlds, also fictitious ones, recurrently described in them. That 
hypothetical proposition may be also related to the problem of extracting 
common algorithmic information [35]. 
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(la) Consider a stationary uncountable description process 
(Xi)igz over a finite alphabet. If the cardinality of the 
set Us{n), defined in (|3]l, is greater than cin^ then E{n) 
is not less than C2n^, for some positive ci and C2. 
(lb) Consider a finite-energy stationary process {Xi)i^z over 
a finite alphabet. If E{n) is greater than C2n^ then the 
shortest grammar-based compression of the block Xi,,-, 
applies at least can''/ log n distinct nonterminal symbols 
on average, for some positive C2 and C3. 
The exact statements are to be understood in an asymptotic 
sense, explained in the following section. A heuristic proof 
of proposition (lb) was sketched in [14]. This paper furnishes 
the formal proof and develops a discussion of the logically 
earlier proposition (la), supplemented by the construction of 
some suitable stochastic processes. 

Although the tools used to demonstrate propositions (la) and 
(lb) are different, it is reasonable to consider these propositions 
jointly as a means to formalize and prove thesis (I). The 
reasons are following. Hilberg's thesis was formulated merely 
on base of the block entropy estimates for printed English 
published by Shannon [37]. It can be argued that Shannon's 
estimates are too crude to infer the asymptotic behavior of ex- 
cess entropies E{n), cf. [45]. Proposition (la) makes Hilberg's 
thesis more likely, regardless of the estimation difficulties. 
Conversely, (lb) adds some empirical aspect to the rationalist 
statement (la). Discussing propositions (la) and (lb) together 
may also provide more insight into stronger formalizations of 
thesis (I) than the main theorem of this manuscript. 

The remainder of this paper is split into several semi- 
independent parts. An overview of the composition is given 
in the next section. Since the manuscript is multidisciplinary, 
we tried to keep it self-contained. 

II. An overview of the tools and results 

The central result of this article is Theorem 15.11 in Section 
rVl which formalizes thesis (I). The exact phrasing of the 
theorem is not reproduced in advance since it depends on 
a pretty long construction in Section |III1 which covers a new 
class of grammar-based codes. Contrary to a typical scheme 
of presentation, it is easier to give a heuristic sketch of the 
proof first, which will be done right now. 

Let H{n) := H{Xt+i..t+n) = -E \ogP{Xt+i..t+n) be the 
71-symbol block entropy of a stationary process {Xi)i^z,, where 
variables Xi : fl X assume values from the countable set 
X. An important parameter of the process is its entropy rate 



h := inf H{n)/n = lim H{n)/n 

n£N n-+oo 



(6) 



Consider also the set of well-predictable facts Us{n), defined 
in equation (O. Let us define the block "pseudoentropy" 



H^{n) ■= hn+[\og2-7^{5)] •cardC/5(n), 



where 



•nip) 



-pXogp- (1 -p) log(l -_p). 



(7) 



(8) 



is the entropy of binary distribution (p, 1 — p) and card Us {n) 
denotes the cardinality of set Us{n). 



Using some facts about the ergodic decomposition, to be 
derived in Subsections IIV-BI and IIV-CI we can prove that 
H{n) > [n) and lim„ {n)/n = h for a finite alphabet 
X. Thus the excess-bounding Lemma [TTI from Appendix |T] can 
be applied to function G{n) = H{n) — [n). In particular, 
we obtain 



lim inf 



cardUs{n) 



> 



lim sup 



E{n) 



>0 (9) 



for the n-symbol excess entropy E{n) = 2H{n) — H{2n), as 
an instance of implication (1102b . 

Implication (|9]l formalizes proposition (la). The premise is 
true in particular for the strongly nonergodic process (01) with 
{Ki)i^z ~ IID and the marginal distribution P{Ki = A;) oc 
k^^/^ (cf. Subsection I VI-BI ). Although this process is over an 
infinite alphabet, the right-hand side of (|9]) holds as well (cf. 
Subsection I VI-CI ). 

In the following, let us consider proposition (lb). Before 
focusing on grammar-based codes, we discuss a less specific 
case. Denote the set of nonempty strings as X+ :— IJneN-^" 
and the set of all strings as X* := X+ U {A}, where A is the 
empty string. Let C : X+ Y+ be a uniquely decodable 
code over an input alphabet X and a finite output alphabet 
Y = {0, 1, Dy — 1}, i.e., its extension C* : (ui, Uk) ^ 
C{ui)...C{uk) to finite tuples of strings Ui E X* is an 
injection. Denote the expected length of code C as 



(10) 



For a uniquely decodable code, the coding inequality 
H'-'{n) > H{n) is satisfied [31] and thus the code will be 
called universal if its limiting compression rate lim„ H'~^{n) jn 
equals the entropy rate, i.e., lim„ (n)/n = h for any 
stationary process. There are no universal codes for an infinite 
input alphabet X [46], [47] but they exist for a finite X [48], 
[49], [5]. 

Let us observe that if the code C is universal then there 
holds an equality of rates 

lim H'^(n)/n^ lim H{n)/n^ lim H^{n)/n (11) 

n — >oo n — >OQ n — *oo 

and a transitive inequality 

H^iu) > H{n) > H"{n). (12) 
Hence, as an instance of relations ( llOlb and ( 1102b . implications 



liminf:^>0 



E^{n) 
Imisup 3 — > 0, 



. card [/a (n) 
hm mf TT--^ > 



lim sup 3 — > 



(13) 
(14) 



hold for the expected excess length of the code 

E^{n) := 2H^{n) - {2n) 

= E [|C(Xi:„)| + \CiXr,+ l:2n)\ " |C(Xi:2„)|] log -Dy • 

(15) 

The implication converse to ( |T3] ) is not true, which follows 
from the negative result of [50], see Appendix HIH 

Whereas relation (|9]) rephrases thesis (la), implications ( fT3l ) 
and (O correspond in part to theses (lb) and (I) respectively. 
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The missing part of the correspondence is that (fTJt and (fT4l i 
do not contain the specific bound for the vocabulary size of 
a grammar-based code. 

By the second line of formula ( fTsT l. the suitable completion 
of ( fT3] l and (fT4l i can be given by a lower bound for the 
vocabulary size of the shortest grammar-based code C in terms 
of the code's excess length 



\C{u)\ + \C{v)\-\Ciuv)\ 



(16) 



provided code C is universal. If one considers the code 
length |C(w)| as an analogue of the algorithmic complexity 
of string u, cf. [6], then (fTSI l is the analogue of algorithmic 
mutual information [51]. The technical details of bounding 
the vocabulary size in terms of the excess code length (fT6] l 
are easily motivated by the following heuristic reasoning. 

Grammar-based codes compress strings by transforming 
them first into special grammars, called admissible grammars 
[5], and then encoding the grammars back into strings ac- 
cording to a fixed simple schema. An admissible grammar 
is a context-free grammar which generates some singleton 
language {w}, w g X+, and whose production rules do not 
have empty right-hand sides [5]. In such a grammar, there is 
one rule per nonterminal symbol and the nonterminals can be 
ordered so that the symbols are rewritten onto strings of strictly 
succeeding symbols [5]. Hence, an admissible grammar is 
given by its set of production rules 

Ai -> ai, 

0L2-, 



G 



(17) 



Ar, 



where A\ is the start symbol, other Ai are secondary non- 
terminals, and the right-hand sides of rules satisfy on E 
{{Ai+i, Ai+2, ■■■,An} U X)+. The vocabulary size of G, i.e., 
the number of used nonterminal symbols, will be written 

y[G] := ca.i-d{Ai,A2, ...,An} = n. 

On the other hand, the Yang-Kieffer length of grammar G is 

|G|:=EJ«.I, (18) 

where \a\ is the length of a e ({Ai, ^2, ■■■,An} U X)* [5]. 

If a string w contains many repeated substrings then some 
grammar for w can "factor out" the repetitions and may be 
used to represent w concisely. The set of admissible grammars 
will be denoted as Q while Q{w) C G will stand for the subset 
of admissible grammars which generate the language {w}, 
w e X+. A function T : X+ ^ ^ such that r{w) G g{w) for 
all w e X"*" is called a grammar transform [5]. 

We may suppose naively that the length of the shortest 
grammar |r(w)| for w is a sufficiently good approximation 
of the length of the shortest universal grammar-based code 
|C(?i;)|, cf. [6], [52]. Thus, we could obtain an upper bound 
for the excess code length ( fTSI l. needed to establish (lb), from 
a similar bound for the excess grammar length 

\r{u)\ + \r{v)\-\r{uv)\. 

Indeed, there is a simple bound for the latter quantity in terms 
of the vocabulary size. 



Theorem 2.1: Let F be a minimal grammar transform, i.e., 
|rH|= min |G| (19) 

and let L(ti;) be the maximal length of a (possibly overlapping) 
repeat in w, i.e., 

L(w) :— max{|s| : w = xisyi — X2sy2 A xi ^ X2} , (20) 

where s,Xi,yi E X*. For any strings w — uv,u,v E X* we 
have 

< |F(u)| + \T{v)\ - \r{w)\ < Y[r{w)]L{w). (21) 

Proof: This result was noticed in part in [14, Theorem 
3]. A brief justification is as follows. For any string a E 
{{A2, A3, An} U X)*, denote its expansion with respect 
to (T7\ as {a)^, i.e., is the language generated by 

grammar ( [TtT i with ai ~ a [6]. Let a minimal grammar for 
w — uv he of the form 



G 





XlXmXr 


A2- 


"2, 


An " 


an 



> . 



We will split it into two separate grammars for u and v. 



G, 



Ai xlVl, 
A2 a2. 





r A,^ 


^ VfiXr, 


Gr=< 


A2 ' 


^ 0^2, 




, An- 





where the string xi\f of length \xm\ < 1 at the boundary of 
the descriptions for u and w gets expanded into a string of 
terminal symbols {xm)q = VlVr G X*. Since \ai\ < L(w) 
and lyLVnl < L(w) by minimality of G, we obtain 



\T{u)\ + \T{v)\<\Gl\ + \Gr\<\G\ 



L{w). 



Regrouping the terms yields the right inequality in ( 1211 1. The 
proof of the left inequaUty appUes grammar joining rather than 
splitting and can be found in [14]. ■ 

Inequality ( l2Tl i constitutes a nontrivial lower bound of the 
vocabulary size only if the maximal repeat length L(ii;) can 
be upper-bounded well enough. A logarithmic bound for the 
latter is the best what we may count on, L(w) = 0(log 
and it actually holds for finite-energy processes almost surely 
[36] , i.e., h{Xi:n) — 0(log7i) a.s., as well as in expectation. 
These results and the definition of finite-energy processes are 
detailed for reference in Appendix |ll] 

Although inequalities (O, (O, and dlTT i combined 
with Lemma 122] from Appendix Ullprovide a heuristic rationale 
in favor of theses (la), (lb), and (I), they do not constitute 
a rigorous proof. The flaw is that the minimal Yang-Kieffer 
length |r(-)| is a too crude approximation of a universal 
code length. For any uniquely decodable code C, we have 
lim„ max^gx" |C(w)| /n > 1 necessarily. On the other hand, 
a grammar transform T is called asymptotically compact if 



lim max |r(i(;)| /n — 



(22) 
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and for each grammar in r(X+) each nonterminal has a differ- 
ent expansion. In particular, any minimal grammar transform 
(T% is asymptotically compact [5], [6]. 

In the following three sections we will construct a rigorous 
formalization of thesis (I) and its formal proof: 

(i) In Section |III1 we will build a new class of universal 
grammar-based codes over a finite alphabet. Rather than 
applying the standard grammar-to-string encoder by Ki- 
effer and Yang [5], these codes use a novel local encoder 
inspired by the simplistic code of Neuhoff and Shields 
[49] (Subsections IIII-AI through IIII-DI ). Thus the codes 
satisfy an analogue of i2T[ with the excess code length 
( fTSI l substituted for the excess grammar length. 

(ii) Section HV] is a study of nonergodic stationary processes. 
It provides the proofs of equality (fTTl l and inequality (fT2l l 
for strongly nonergodic processes over a finite alphabet. 
Some useful preliminary facts to be introduced include 
elementary algebraic identities satisfied by excess entropy 
and the ergodic decomposition of this quantity (Subsec- 
tions and cf. also [15]). 

(iii) Section |V]puts the results together Thesis (I) is expressed 
as a formal statement, namely Theorem 15. II Several ideas 
for formulating propositions that would be stronger than 
Theorem 15.11 are discussed immediately after its proof. 

The issue of this paper, is in fact, is to both formaUze and 
prove thesis (I). From this point of view, it is important to 
construct examples of stochastic processes which satisfy the 
assumption of Theorem 15.11 and to demonstrate that they are 
relevant for natural language modelling. These questions will 
be dealt with in Section |VT] 

The article is briefly concluded in Section IVIII Four appen- 
dices in the following provide supplementary material. The 
excess-bounding lemma for sublinear nonnegative functions 
is exposed in Appendix U Appendix presents bounds for 
the length of the longest repeat. Two results concerning the 
difference E'~^{n) — E{n) are derived in Appendix |III] In 
Appendix lIVI we discuss a peculiar behavior of the vocabulary 
size for the Yang-Kieffer codes based on irreducible grammar 
transforms. Namely, the vocabulary size of these codes is 
a growing function of the entropy rate h of the compressed 
process. The vocabulary size of the new class of grammar- 
based codes seems rather a growing function of the redundancy 
H{1) - h. 

III. Grammar-based codes 

For the set of admissible grammars Q, a grammar-based 
code is a uniquely decodable code of form C = B{r{-)) : 
X+ Y+, where F : X+ Q is, a. (string-to-)grammar 
transform and B : Q ~* Y+ is called a grammar(-to-string) 
encoder [5]. In principle, the grammar encoder should be 
chosen as sufficiently good for many different grammar trans- 
forms. To guarantee the existence of universal codes of form 
C — B{T{-)), we shall assume further in this section that both 
input and output alphabets are finite, X = {0, 1, Dx — 1} 
and Y = {0, 1, Dy — 1} in particular 

Indeed, there exists a grammar encoder Byk : Q Y+ [5], 
called Yang-Kieffer encoder, such that 



(i) set -Byk(5) is prefix-free, 

(ii) |Byk(G)| < \G\ {A + log^^ \G\) for some A > 0, 

(iii) C = BYK(r( )) is a universal code for any asymptotically 
compact transform F. 

Unfortunately, in the case of code C = i?YK(r(-)), it is hard to 
compare the excess grammar length |F(m)| + |F(w)| — |r(uv)| 
with the excess code length |C(w)| + |C'(w)| — |C(mw)|. Thus 
we will consider another grammar encoder 

Let us notice that notation ( fTTl l can be reduced to 

G = (ai, a2, a„) (23) 

without any confusion. Subsequently, we will write ( |23] | in- 
stead of ( fTTl l. We will also define a grammar encoder that 
represents G as a string resembling list ( f23] ). This encoder 
yields universal codes given a simple condition (Theorem 
13.91 ) and provides nearly a homomorphism between some 
operations on grammars and strings. Hence the universal codes 
satisfy an analogue of Theorem 12. 11 as well (Theorem 13. 111 ). 

A. Local grammar encoders 

The proof of inequality ( f2Ti ) sketched in Section [II] applies 
certain "cut-and-paste" operations on grammars. Besides the 
operations mentioned there, the following one was used in [14] 
to prove that the left-hand side of ( f2Ti ) is nonnegative: 

Definition 3.1: ® : Q y. Q ^ Q is called grammar joining 

if 

Gi e ^(wi) A Gs e ^(wi) =^ Gi e G2 e ^(^1^2)- 

It would be convenient to use a grammar joining ® and an 
encoder B : Q ^ X+ such that the edit distance between 
B{Gi ® G2) and B[Gi)B[G2) be smafl. Without making 
the idea too precise, such joining and encoder will be called 
adapted. 

The following example of mutually adapted joining © and 
encoder B will be used in the consecutive sections. Firstly, let 
us introduce a useful notation. 

Definition 3.2: For any function / : U ^ W*, where 
concatenation on domains U* and W* is defined, denote its 
extension onto strings as 

/* : U* 9 x^X2...x^ ^ f{xi)f{x2)...f{xm) e W*. (24) 

Now for Gi = {an, ai2, Q;i„.), i — 1,2, define 
Gi®G2 := (A2A„,+2,i/i(aii),i/i(ai2),...,i?r(ai„J, 

i?2 ("21 ), i?2* ("22 ),■•■, i?2 ("2«2 )) , 

where Hi{Aj) := Aj+i and H2{Aj) := Aj+m+i for 
nonterminals and Hi{x) :— H2{x) :— x for terminals x E X. 

In the next construction, the set of natural numbers N is 
treated as a generic infinite countable alphabet with concate- 
nation ab, addition a + b, and subtraction a — b. 

Definition 3.3: B : Q Y+ is a local grammar encoder if 

B{G) = Bl{B^{G)), (25) 

where: 
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(i) the function Sn '■ G ({0} U N)* encodes grammars 
as strings of natural numbers so that the encoding of 
a grammar G — (ai, a2, a„) is the string 

B^{G) := Fl{ai)DxF;{a2)Dx...DxF:{a^)[Dx+l), 

which employs relative indexing Fi{Aj) := Dx + ^+j — i 
for nonterminals and the identity transformation Fi{x) := 
X for terminals x e X = {0, 1, Dx — 1}, 

(ii) Ss is a function of form Bs : {0} U N ^ Y* (for tech- 
nical purpose of the next subsection, not necessarily an 
injection) — we will call the natural number encoder. 

Indeed, local encoders are adapted to the joining operation ®. 
For instance, if B{Gi) — UiB^{Dx + 1) for some grammars 
G„ i = 1, 2, then B(Gi © G2) = B^iDx + 2)BsiDx + 2 + 
Y[Gi])Bs{Dx)uiBs{Dx)u2BsiDx + 1). 

There exist many prefix-free local encoders. Obviously, the 
set Bf^{Q) is prefix-free itself. Therefore, the encoder ( l25T l is 
prefix-free (and uniquely decodable) if Bs is also prefix-free, 
i.e., if Bs is an injection and set i?s({0} U N) is prefix-free. 

B. Encoder-induced grammar lengths 

Let us generalize the definition of the grammar length to 
include the notion of a universal code length as a special case. 

Definition 3.4: For a grammar encoder B : Q ^ Y+, the 
function \B{-)\ will be called the B-induced grammar length. 
For example, Yang-Kieffer length | • | is i?-induced for a local 
grammar encoder B — B^{B-t^{-)), where 

/a foTxe{Dx,Dx + l}. 
''^^") = \0 else. ^''^ 

In the same spirit, we can extend the idea of the smallest 
grammar with respect to the Yang-Kieffer length, discussed in 
[6]. A subclass C G of admissible grammars will be called 
sufficient if there exists a grammar transform F : X+ J , 
i.e., if n Q{w) ^ for all w e X+. On the other hand, 
a grammar transform F will be called a J^-grammar transform 
if F(X+) C J. 

Definition 3.5: For an arbitrary grammar length function 
IMI : G — > {0} U N, a J^-grammar transform F will be called 
(IHI ,J^)-minimal grammar transform if ||F(ii;)|| < |jG|| for 
all G e g{w) n and w e X+. 

Definition 3.6: The code B{T{-)) will be called {B,J)- 
minimal if F is (||-|| -minimal for the i?-induced grammar 
length II -11. 

Definition 3.7: For a grammar length ||-||, the grammar 
subclasses J ,JC <zQ are called \\-\\-equivalent if 

min ||G|| = min ||G|| for all w e X+. 

Geg{w)nj Geg{w)nK 

C. Subclasses of grammars 

In subsection IIII-EI we will bound the excess lengths of 
{B, j7)-minimal codes, where B are local encoders and J' are 
some sufficient subclasses. In subsection IIII-DI we will show 
that several of these codes are universal. Prior to this, let us 
introduce several subclasses of grammars J ^ Q for which: 



(i) our results hold, (ii) the computation of (B, J') -minimal 
codes may be easier than for (_B, ^) -minimal ones, and (iii) 
the interpretation of grammars' vocabulary size as the number 
of distinct words in a linguistic sense seems more plausible. 

First, we will say that (ai, 012, a„) is 2, fiat grammar 
if OLi G X"*" for i > \. The set of flat grammars will be 
denoted as T . In particular, flat grammars were considered 
in the computational linguistic experiment by [26]. Next, 
symbol C T will denote the class of k-block interleaved 
grammars, i.e., flat grammars (ai, 02, a„), where Ui E X*"' 
for i > 1. As a further subclass, Bk C T>k will stand for the 
set of k-block grammars, i.e., fc-block interleaved grammars 
(ww;,a2, ...,a„), where string u G {{A2, A3, An})* con- 
tains occurrences of all A2 , ^3 , . . . , A„ and string w E X* 
has length |?i;| < k, cf. [49]. Of course, classes Bk, T^k, 
B := Ufc>i ^k, '■= Ufe>i ^fe' ^^'^ ^''^ sufficient. 

On the other hand, grammar (ai, a^, an) is called irre- 
ducible if 

(i) each string ai has a different expansion (ai)^. and 
satisfies |ai| > 1, 

(ii) each secondary nonterminal appears in string aia2...an 
at least twice, 

(iii) each pair of consecutive symbols in strings ai, Q!2, an 
appears at most once at nonoverlapping positions [5]. 

The set of irreducible grammars will be denoted as T. 

Class I is important in the theory of grammar-based com- 
pression for two reasons. Firstly, any X-grammar transform is 
asymptotically compact [5] so it yields a universal code when 
combined with the grammar encoder Byk- Secondly, there is 
an X-grammar transform which is (| • | , CJ) -minimal. 

Theorem 3.8: The classes T and Q are | • | -equivalent. 
Proof: Starting with any grammar Gi E G{w), a gram- 
mar G2 e X n G{w) can be constructed by applying a se- 
quence of certain reduction rules until the local minimum 
of functional 2 | • | — V[-] is achieved [5]. In fact, the only 
reduction applicable to a grammar that minimizes | • | is the 
introduction of a new nonterminal denoting a pair of symbols 
which appears exactly twice on the right-hand side of the 
grammar, cf. Section VI in [5]. This reduction conserves the 
Yang-Kieffer length. ■ 

D. Universal codes for local encoders 

The local encoders in our sense resemble the encoder Bns 
considered by Neuhoff and Shields [49] as an encoder for the 
class of block grammars B. The authors have established that 
any (Bns, -minimal code is universal. The main difference 
between the encoder i?NS and a local encoder is that _Bns 
encodes a nonterminal Ai as a string of length [log^^ V[G]J + 
1 whereas the local encoder uses a string of length |i3s(£'x + 
i)\. This is not a big difference so we can easily prove the 
following proposition using some results of [49]. 

Theorem 3.9: Let Bs be such a prefix-free natural number 
encoder that |-Bs(-)I growing and asymptotically optimal, 
i.e., 

limsup|Bs(?^)|/logc^ n = 1. (27) 
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For any sufficient subclass of grammars J Z) B, every 
{B'^{B^{-)), J)-mm]m?A code C is strongly universal, i.e.. 



and 



lim sup < h a.s. 



(28) 



(29) 



for every stationary ergodic process (^A;)fcgz- 
Remark: Claims ( l28T l and ( |29] ) can be generalized to stationary 
nonergodic processes as follows. Firstly, the strong ergodic de- 
composition theorem [53, a statement in the proof of Theorem 
9.12] and inequality (l29T l imply 

limsup <hF a.s. (30) 

for any stationary process (Xfc)fcez> where /ij? is the entropy 
rate of the process's random ergodic measure, viz. ( ISTl ) and 
dSOjl. Since < |C(Xi:„)| < i^n for a if > 0, inequaUty 
( |30] | implies equality ( l28T l for any stationary process {Xk)k£Z 
by formula ( l53T l and the inverse Fatou lemma, cf. [54]. 
Consecutively, the generalized ( l28T l and ( l30l l imply that we 
have in fact equality 

r \C{X,.,r,)\\ogDY , ,,,, 
limsup —hp a.s. (31) 

n — >oo ^T- 

Proof: Consider a sequence of Sfe -grammar transforms 
Ffe. For an e > and a stationary ergodic process {Xk)kez 
with entropy rate h, let k{n) be the largest integer k satisfying 

r loggy V[rfc(n)(w)] ^ , , „ 
limsup max -— ^ < n, + 2e, 

lim EV[rfe(„)(Xi:„)]-fc(n)/n = 0, 

n— >oo 

lim V[rfc(„)(Xi:„)] • k{n)/n = a.s., cf. [49]. 

n — >oo 

Since Imin k{n) = cx), a ^7) -minimal code is universal if 

\B{Tk{w))\<ak\[Tk{w)]+-f{k)-\og^^y[Tk{w)], 

where a > and lim^ 7(fc) = 1. In particular, this inequality 
holds for (|25ll, (|27li, and growing |Bs(-)l- ■ 
The prefix-free natural number encoder satisfying (|27] | 
can be chosen, e.g., as the Elias Dy-ary representation oj : 
{0} U N ^ Y* [55], \Lj{n)\ = t{n), where 



l{n) := 



if n < Dy, 
1 \fn>DY. 



E. Bounds for the vocabulary size 

Let us derive the analogue of Theorem |2.1| for some minimal 
grammar-based codes that use the local grammar encoders. 
Firstly, the code lengths are almost subadditive. Secondly, 
the excess code lengths are dominated by the vocabulary 
size multiplied by the length of the longest repeat. The code 
universality is irrelevant for the proofs. 

Definition 3.10: Consider a grammar 

G = (ai,Q;2, ...,q;„) € g{w). 



For < p,q < \w\ and p + q = \w\, let u,v E H* be the 
strings such that p — \u\, q ~ \v\ and uv = w. Then define 
the left and right croppings of G as 

LpG := {xLVL,a2, ■•■,an) e Q[u), 
KgG := [vbXr, a2, ...,a„) e Q{v), 

where exactly one of the following conditions holds: 

(i) ai = xlXr and ulVr = A, 

(ii) ai = XL^iXji for some nonterminal Ai, 2 < i < n, with 
expansion {Ai)^ = VlVr- 

Moreover, define the flattening 

¥G (ai, (02)0 , (aa)^ , (a„)(3) 
and the secondary part 

SG := (A,Q!2,q;3, •■■,"«)• 

Theorem 3.11: Let B be the local encoder dZST l. Introduce 
constants 



max |_Bs('T-)|. 

0<n<Dx+2+m 



(32) 



Let r be a (||-|| -minimal grammar transform for the B- 
induced grammar length || ||. Consider the code G = B{T{-)), 
strings w^u^v e X+, and a grammar class JC which is ||-||- 
equivalent to J . 

(i) If Gi, G2 e J =^ Gi ® G2 e /C then 

\C{u)\ + \G{v)\ - |G(™)| > -3Wo - Wv[r(«)]. (33) 

(ii) If G e J =^ L„G, R„G G /C for all valid n then 

\C{u)\ , |G(t;)| < |G(™)| + WoUuv). (34) 
|G(u)| + \C{v)\ - \C{uv)\ < \\ST{uv)\\ + WoMuv). (35) 

(iii) If G e =^ FG e /C then 

||§r(?«)|| + WoUw) < WoV[r(u>)](i + lh). (36) 

Remark: In particular, (|33] ) holds for J = Q,T while inequal- 
ities (O-dS hold for = g,I,T,V,Vk. Moreover, (O 
and ( |36] | together imply bound 

|G(m)| + |G(t')| - |G(™)| < W^y[T{uv)]{l + Uuv)), (37) 
which generalizes the inequality (l2Tl i. 

(i) The resuh is implied by ||r(-itw)|| < |lr(u) © r('i;)|l and 
||Gi®G2|| < ||Gi|| + ||G2|| + |Ss(i?x+2+V[Gi])|+3iyo, 

where Gi = T{u) and G2 = r(i;). 

(ii) Set p — \u\, q — \v\, and w — uv. The inequalities follow 
from 

IIFHII + WoUw) > IILpFHIl > \\T{u)\\ , 

||rHII + w'oLM>P,rH||>||r(t;)||, 

and 

IlLpFHii+iiR^rHii < llFHII+IISFHII+WoLH. 

(iii) The claim is entailed by ||§r(w)|| < ||SFF(w)|| and 

||SFr(w)|| < Wo {\[r{w)] - 1) (1 + LH) + Wq. 
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where 



Although Theorems 12.11 and 13.1 II are analogous, there is 
a huge qualitative difference between the codes based on 
irreducible grammars that apply the Yang-Kieffer encoder _Byk 
and the universal grammar-based codes that minimize the 
length induced by the local encoder. The vocabulary size of the 
former codes is lower-bounded also by the square root of the 
code length. Thus these codes appear to see more structure 
in IID strings than in data that exhibit a sequential order. 
In contrast, the newly considered codes discover much less 
structure in the IID case, cf. Appendix IIVI and the experiment 
[15]. 

IV. Stationary processes 

In this section we explore stationary processes rather than 
codes. The goal is to prove equality (fTTl i and inequality (fTSl i 
for strongly nonergodic processes over a finite alphabet. The 
proofs will be given in Subsection IIV-CI In the preparatory 
Subsection IIV-AI we shall discuss some elementary algebraic 
identities satisfied by excess entropy E = lim„ E(n), the 
limit of the n-symbol excess entropies E{n). This is followed 
by an analysis of the ergodic decomposition of E and E{n) 
in Subsection IIV-BI which also provides some necessary 
material. A by-product of this decomposition is the proposition 
that the excess code lengths E'-^{n) are unbounded for all but 
countably many ergodic processes, proved in Appendix |III] 

A. The limit of the n-symbol excess entropies 

Let the alphabet X be a generic countable set again. Con- 
sider the sequence of the n-symbol excess entropies E{n) — 
I {X ^n+i:Q', Xi-n) defined in (|5]l. Since E{n) cannot decrease 
with growing n, we may define the limiting value 



E = sup E{n) — lim E{r 



(38) 



called simply excess entropy [38]. Although less attention was 
paid in information theory to E than to entropy rate (|6]l, excess 
entropy satisfies a number of neat identities. 

Denote the difference operator as A/(n) — f{n) — /(n — 1) 
and assume H{Xi) < oo. The first two differences of block 
entropy H{n) :— H{Xi-n), with H{0) := 0, are conditional 
entropy AH{n) — H{Xn\Xi:n-i) and minus conditional 
mutual information A^iJ(n) = —I{Xi;Xn\X2:n-i)- Observe 
that H (n) , AH (n) , — A'^ H (n) e [0,cx)[, whereas the entropy 
rate (|6]l satisfies equality h — lim„ AH{n) for any countable 
alphabet (only the finite alphabet case was considered e.g. in 
[56, Section 2.9]). Hence, as it was derived in [38], we have 



E 



lim [H{n) - nAH{n)] 

n — ^oo 

lim [H{n) — nh] 



smce 



H{n) - nAH(n) = - Y2=2i^ - l)A^iJ(fc), 



(39) 



(40) 
(41) 



I fc - 1, 2<k<n, 
|2n-fc + l, n+\<k<2n. 



(42) 



In view of (I39l l. excess entropy equals the nonnegative 
deviation of block entropy from the asymptotic linear growth. 
Since I{Xi] Xn\X2:n-i) — for n > k for a fc-th order 
Markov process, E is finite in this case. Moreover, ex- 
cess entropy is finite for finite-state sources, a.k.a. hidden 
Markov processes [57], [38], by the data-processing inequality 
/(Xi:„;X„+i:2„) < /(Xi,„;y„) < sup H{Yra) < oo, 
where Yn is the hidden state at time nO Whereas finite- 
state sources are state-of-the-art models in many applications, 
including computational linguistics [60], [25], Hilberg's obser- 
vation of an empirical power law E{n) x y/n indicates that 
a larger class of models may be worth considering. 

B. Ergodic decomposition of excess entropy 

In this subsection we will discuss a representation of excess 
entropy that pertains to the ergodic decomposition of a sta- 
tionary process. This result will be utilized in the following 
subsection, which concerns strongly nonergodic processes. We 
shall take for granted many facts that were mentioned and 
derived elsewhere. The measure-theoretic generalization of 
conditional mutual information [61], [62], [63], [3] is a tool 
that we need to recall in the very beginning. 

For probability space {n,J',P), a partition of the algebra 
J' C 2^^, being the domain of probability measure P : J ^ 
M, is a finite set of events {Bj}^.^^ such that Bi n Bj = 
and U 1=1 Bj = ^- J'^st like for discrete variables, define 
mutual information between partitions a = {Ai}^^^ and (3 — 
{Bj}'^.^^ with respect to probability measure P as 



Mo;ffl:=i:i:P(Ani,,)log|i^ 



(43) 



where GlogO/a; :— 0. 

Let A, B, and C be the subalgebras of algebra J . That 
is, {%,?t\ C A,B,C C J s& well as A 6, C, and J 
are closed against operations n, U, and \. Moreover let the 
random variable P(A||C) be the conditional probability of 
event A ^ J w.r.t. the smallest cr-field containing C [64, 
Section 33]. We may extend the concepts of conditional 
mutual information, mutual information, conditional entropy, 
and entropy respectively as 



I{A;B\C):= sup E/p(.||c) 

I{A;B) ■.= I{A;B\{(l),n}), 
H{A\C) -.^liA-^AlC), 
H{A) ■.^I{A;A\ 



(44) 

(45) 
(46) 
(47) 



cf. [3], [62], [63], [65, Section 12]. These concepts generalize 
the definitions for random variables in a natural way. If 

'*By a similar reasoning, excess entropy is finite also for the Gaussian 
ARMA. Some disguised expressions for the excess entropy of Gaussian 
processes were evaluated in [58, Section 5.5], [59]. 
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we consider discrete random variables Yi and the smallest 
subalgebras Ai C such that all events of the form {Yi ~ yi) 
belong to A, then /(Yi;y2|>3) - /(A;-42|^3), I{Yi;Y2) - 
I{Ai;A2), H{Yi\Y3) - H{Ai\A3), and i/(Yi) = I{Ai). 
Quantities (|44]|-(|47T| also satisfy the additivity relation (chain 
rules) and enjoy certain continuity [62], [63], [3, Theorems 1 
and 2]. 

Consider process {Xk)kei, again, where variables Xi : 
(ri, J') (X, X) provide a mapping between measurable 
spaces, X being countable. The completion a (A) of algebra A 
is the smallest cr-algebra that contains both all elements of A 
and all elements of 2^ of outer P-measure 0. The completions 
of the two subalgebras that contain events {Xi = Xi) for 
all i < and for all j > 1 respectively will be denoted as 
G<o>G>i C i/. Then, by continuity of mutual information 
[62], [63, Section 2.2], 

E = I{g<o;g>i). (48) 

On the other hand, by the chain rule for conditional mutual 
information [63, Section 3.6], [3, Theorem 2(ii)], we have 

i{g<o; Q>i) = i{g<o; ^) + i{g<o; e>i|^) 

= H{T) + i{g<„;g>i\j^) (49) 

for any algebra C g<o nQ>i. 

In fact, we may choose to be the preimage of the 
process's shift-invariant algebra [3, Lemma 3]. To be precise, 
denote the product measurable space of doubly infinite se- 
quences {U,U) = Xfcgz(X, A"). For the shift transformation 
T : U 9 {xk)k£Z 1-^ (xfc+i)fcgz G U, where Xk £ X, define 
the invariant algebra T := {A E U : TA = A}. Now let 

^ = {x^)ii^{i) C J. 

For this algebra, conditional information I{G<q',G>i\^) 
can be interpreted in an interesting way. Let {S,S) be the 
measurable space of stationary probability measures on {IJ,U) 
(i.e., /X o T = ^ for ^ e S) and let (E, £) C (S, S) be the 
subspace of ergodic measures (i.e., n{A) G {0, 1} for /i S E 
and A E T). Precisely, S and £ are defined as the smallest 
(T-fields containing all cylinder sets {/x G S : < r} and 
e E : fi{A) < r}, A E U, r € M., respectively. Since U 
is countably generated, all respective singletons {/i} belong 
to S and £. According to the strong ergodic decomposition 
theorem [53, Theorems 9.10-12] there exists an almost surely 
unique random variable F : (il, T) — > (E, £) such that 

F{A)^P{{X,UzeA\\T), (50) 

for all A eU. The variable F will be called here the random 
ergodic measure of {Xi)i^j,- For every stationary process, the 
distribution ^{W) :— P{F E W), W E £, is given uniquely. 

In the following it is convenient to consider information 
measures for the process {Xk)kez as functions of the process 
distribution. For an arbitrary distribution /i = P{{Xk)kez G 
•) G S, we will consider the following parameterization: 

H^{n) -.^ H{n), h^:=h, (51) 

E^{n):=E{n), E^:=E. (52) 



Plugging F for ji and using equality EF(A) — P{{Xi)i^i E 
A), we obtain: 

Theorem 4.1: For a stationary process over an alphabet X, 

h^Ehp if X is finite, (53) 

E = H{T) +EEf if X is countable. (54) 

Equality ( l54l l follows from ( |49] l. whereas decomposition (l53T l 
for the entropy rate was derived by [66] . The exact proof that 
EEp = I{G<OtG>i\^) can be found in [3, Theorem 5]. 

There are also finitely-dimensional analogues of ( |53l l and 
(l54l i. which will be applied in the following section. To write 
them down, define the triple mutual information (TMI) as 

/(X; Y- Z) := I{X; Z) + I{Y: Z) - I{X, Y; Z) (55) 

in the case of finite mutual information I{X] Z), I{Y; Z), and 
I{X,Y;Z). If entropies H{X), H{Y), and H{Z) are finite 
then the value of I{X; Y; Z) does not depend on the argument 
permutations [56]. Anyway, we cannot use a construction anal- 
ogous to (l44l i to extend the TMI to a permutation-independent 
function of arbitrary fields, since I{X; Y; Z) is not necessarily 
positive and monotonic. 

Theorem 4.2: For a stationary process over a countable 
alphabet, 

H{n)^I{Xi.,^;T)+EHF{n), (56) 
E{n) - /(Xi:„; Xi:2„; T)+E EF{n), (57) 

where the second formula holds if H{n) < oo. The limiting 
behavior of the appearing quantities is following: 

(i) In general, 

\im I{Xi.,,,;T)=H{T), (58) 
lim EEF{n) ^EEp. (59) 

n — >oo 

(ii) If the alphabet X is finite, 

lim EHF{n)/n = h with Ei^F(?^) > hn, (60) 

n^oc 

lim /(Xi^„;Xi:2n;^)/n= lim I{Xi.,n\ T) /n ^ Q. (61) 

n— *oo n—>-cC) 

Proof: We have H{n) = H{Xi.,n) and EHF{n) ^ 
H{Xi.,J^T). Hence ^ follows by the chain rule H{A) = 
I{A\B) + H{A\B) [63, Section 3.6], [3, Theorem 2(ii)]. In 
the following, ( |56] | implies (ISTl i. 

The convergence lim„ /(Xi:„; JF) = H{!F) can be es- 
tablished by continuity of (conditional) mutual information 
[63, Section 2.2], [3, Theorems l(v) and 2(i)]. Same for 
lim„E£;F(n) = EEf since EEF{n) = E{Xi,n\T) and 
EEf^I{G<o:G>i\T). 

The equality and inequality in (|60] | can be obtained from 
(|53] | and the general property 

Hf = inf HF{n)/n — lim HF{n)/n 

via the dominated convergence theorem. Equalities (l6Tl i follow 
from (l56T l. (|60] |. and the definition of the entropy rate (O. ■ 
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C. Strongly nonergodic processes 

According to Theorem 14.11 excess entropy E is infinite if 
either EEp or H{T) is infinite. In fact, there exist ergodic 
processes with E — 'E, Ep = oo even for the binary alphabet 
X = {0, 1} [67]. Necessarily, entropy H{T) equals zero if and 
only if the process is ergodic [3, Theorem 8(i) and Lemma 3]. 
On the other hand, equality H{T) — oo holds in particular if 
the completion of JF, denoted by (t{J-), contains a nonatomic 
sub-(T-field [3, Theorem 7]. The latter case corresponds to the 
class of strongly nonergodic processes, defined in Section |I] 

Theorem 4.3: A stationary process {Xi)i^z is strongly non- 
ergodic if and only if (j{J-) contains a nonatomic sub-cr- 
field. Moreover, for a strongly nonergodic process, all events 
(Zfc = 0) and {Zk = 1) belong to a{T) [3, Theorem 9]. 

Whereas E = oo for strongly nonergodic processes in 
general, the n-symbol excess entropy E{n) can be bounded by 
the number of bits Zk that are individually predictable given 
Xi;n with sufficiently high probability. 

Theorem 4.4: For a strongly nonergodic process {Xi)i^z 
we have 

/ (Xi:„; {Zk)km) > [log 2 - TjiS)] ■ card Us{n), (62) 

where S £ (1/2, 1) and Us{n) is defined in equation (|3]l. 

Proof: By the continuity of mutual information [63, 
Section 2.2], [3, Theorems l(v) and 2(i)], 

I {Xl;n', iZk)k£N) ^ l{Xl;n', Zl;k) 

" ■ OO 

(63) 

On the other hand. 



I{Xi;n] Zk\Zi:k-l) — H{Zk\Zi;k-l) — H{Zk\Xi;n, Zi:k-l) 

>\og2-H{Zk\sk{Xi.,,,)) 
>log2-r7(F(sfc(Xi:„) = Zfc)) 

by the Fano inequality H{Yi\Y2) < v{P{Yi = ^2)) for 
a binary variable Yi [56, Theorem 2.47]. Restricting the 
summation in (|63] | to fc G Us{n) yields the claim. ■ 
A corollary of this result is the inequality H{n) > [n] 
satisfied for the pseudoentropy defined in (|7]i for the alphabet 
X being finite. The derivation is as follows. Since all events 
[Zk ~ 0) and {Zk = 1) belong to the completion of algebra 
T by Theorem|431 we have I{Xi.,n;J^) > /(Xi:„; {Zk)k£N) 
by the data processing inequality [3, Theorem l(iv)]. In the 
following, by Theorems 14.21 and 14.41 we obtain 



(64) 



H{n) > hn + I{Xi..n;T) 

>hn + I (Xi:„; iZk)keN) > H^H, 
where lim„ (n) /n = h. 

V. The main result 

Thanks to the intermediate results of the preceding sections, 
we may prove the main theorem of this article. The proposition 
bounds the vocabulary size of a minimal grammar-based com- 
pression in terms of the number of elementary facts predictable 
from the compressed string, provided the string was sampled 



from a finite-energy strongly nonergodic process. This theorem 
can be called a formalization of the thesis (I). 

Theorem 5.1: Let B : Q Y+ be the local grammar 
encoder (IZST i for the output alphabet Y = {0, 1, Dy ~ 1} 
and such a prefix-free natural number encoder that |-Bs()| 
is a growing function and 

limsup|Bs(n)|/log£,^ n ^ I. 

n — >oc 

Consider also a sufficient subclass of admissible grammars J' 
that contains all block grammars, i.e. J Z) B, and satisfies; 

ii) G e J =^ ¥G e J and 

(ii) G G J =^ L^G, M„G G J' for all valid n, 
where operations F, L„, and M„ are given in Definition 13.101 
On the other hand, let {Xi)i(zi be a stationary finite- 
energy strongly nonergodic process over the input alphabet 
X = {0, 1, Dx — !}• Assume that inequality 



(65) 



holds for the set of predictable facts 

Us{n) {fc G N : P [sk [X^-.n) - Zk) > S} , 

where S G (1/2/, 1) and /3 G (0, 1). 

Consider the vocabulary size V[r(Xi:„)] of a {\B{-)\ ,J)- 
minimal grammar transform F : X+ Q. The accepted 
hypotheses entail inequality 



lim sup E 



f v[r(Xi 



\ (log n) 



> 0, p>l. 



(66) 



Proof: Code C = B{T{-)) is universal by Theorem |3.9l 
Hence by Theorems 14.21 14.31 and 14.41 viz. the derivation (|64T) . 
we have H^{u) > H{n) > (n) and 

lim H^{n)/n^ lim H{n)/n^ lim H^{n)/n (67) 

n — >oo n — ^00 n — >oo 

for the expected code length H'~''{n) defined in (fTOl i and the 
pseudoentropy (n) defined in As a result, implication 



lin,inf£^^^M!) >o 



E'^'(n) 

lim sup ^>0 (68) 



holds as an instance of ( 11021 ). 

Consider p,q > 1 such that (p — 1)((7 — 1) = 1. Define 
variables 



Un ■.= Y[r{Xi.,2n)]n-''\ogn, 
Tn (l + L(Xi:2n))(logn)-\ 



(69) 
(70) 



Theorem ITIIIii)-(iii) assures that E^{n)n-f^ < WoE[/„r„ 
for the constant Wq defined in (l32l l. By Holder's inequality, 
we also have EC/„T„ < (E t/P)i/P(E T9)i/«. Since ET« are 
bounded by inequality (|105l l, we obtain 



limsupEC/P > 0. (71) 



E'^in) 
lim sup o — > 



The conclusion follows from the conjunction of propositions 
(|65li, (|68]i, and (|2B- ■ 
We have sought to formulate Theorem 15.11 with possibly 
generic assumptions since probabilistic modeling of natural 
language is full of unknowns and controversies [68], [60], 
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[69], [70]. Moreover several different grammar-based codes 
have been considered in the computational linguistic research 
[23], [24], [25], [26], [15] and we wanted to cover as many 
of them as possible. In particular, our theorem applies to 
codes based on flat grammars ( J = T). These codes yield 
compressions of texts in natural language which are close 
to their decompositions into words in the linguistic sense, 
according to the experiment by [26]. 

We conjecture that there exist many processes that satisfy 
the hypothesis of Theorem 15.11 In the next section, some 
simple examples of processes will be presented that satisfy 
several of the four conditions: (|65] |. stationarity, finite energy, 
and finite alphabet. All these conditions are satisfied indeed 
for a process which can be obtained from of a process of form 
dUi by stationary variable-length coding into alphabet {0, 1, 2}. 

Before we proceed to the examples of processes, let us 
discuss several ideas for strengthening the proposition of 
Theorem 15.11 Striving for realism in language modeling, it is 
desirable to relax the assumption of stationarity and to demon- 
strate the strong law version of inequality (l66T l, under minor 
modifications of other assumptions. Removing the hypothesis 
of strict stationarity would also ease construction of processes 
to which the modified theorem is applicable. 

What conditions suffice for tlie strong law version of 
inequality (|66)? The respective strong law proposition reads 



iimSUp— TT" r — rr > a.S. 

„^oo n'^(logn) ^ 



(72) 



Let us trace back some plausible conditions for ( f72] i. 

Consider the code C ~ i?(r( )) and an arbitrary stationary 
process (Xi)i^-i. According to the Remark after Theorem 13. 91 
we have 

r |C(^i:n)|loggy , 

limsup = hp a.S., (73) 

where hp is the entropy of the process's random ergodic 
measure. On the other hand, the asymptotic equipartition 
theorem for nonergodic processes [65, Theorem 13.1], [71] 
asserts that 



lim (— logP(Xi:„))/n — hp a.s. 



(74) 



Recall also Barron's lemma [72, Theorem 3.1], [36, Lemma 
1], which states that 

|C(Xi:„)|logi^y + logP(Xi:„) > -21ogn (75) 

for all but finitely many n G N almost surely. 
The above three facts imply that function 

G{n) := |C(Xi,„)|logA'+logP(Xi^„) + 21ogn 

almost surely satisfies limsup^. G'(fc)/fc = and G(n) > 
for all but finitely many n G N. Hence 



limsup [2G(n) - G(2n)] > 

n— >oo 

by the excess-bounding Lemma 11.11 



(76) 



Assume now that the process {Xi)i,^i is a finite-energy 
process and satisfies almost surely 

liminf 4 log p.^^^p)-!:^ . > 0, (77) 

lim 4 log ™-\ ^0, (78) 
limM^l.tp^-i^.0. (79) 



Hence (|72] l follows by (|76] l. Theorem 13.1 U ii)-(iii) and the 
almost sure claim of Theorem 12.21 



For which processes are relations (l77t-(l79t satisfied? We 

suppose that (|77T|-(|79]| may hold if the ergodic components 
of process {Xi)i^z are sufficiently similar to one another 
Processes distributed according to Bayesian measures of uni- 
formly discretizable statistical models, a class that includes (01 
and was introduced in [73], seem to satisfy this condition. 

May the stationarity assumption be relaxed? There is no 
reason to assume that stationary processes are the best models 
for texts in natural language. Let us recall that several facts 
in information theory have been generalized from the domain 
of stationary processes to the larger class of asymptotically 
mean stationary (AMS) processes [74], [75]. So is conclusion 
(|72] | true for all AMS finite-energy processes that satisfy (iTTll- 
(|79] l. Indeed, the asymptotic equipartition (l74l i was proved in 
the AMS case explicitly [74, Theorem 8], (l73]l holds by the 
Lemma on page 969 of [74], whereas Barron's lemma dTSl l and 
Theorem 12.21 applv also to a nonstationary process {Xi)i^z- 
(The right-hand side of (iTJt and ( l74l i in the AMS case equals 
the entropy of the random stationary ergodic measure whose 
expectation dominates as the distribution of (Xjigz-) 

May lim inf be substituted for lim sup in (l66t or (l72t? Re- 
placing the upper limit by the lower limit requires developing 
an alternative to Lemma [TTTI As discussed in Appendix U the 
claim of Lemma [TTTI cannot be strengthened if its hypothesis 
is kept intact. 



Does Theorem 15.11 hold for tractably computable codes? 

Computing Yang-Kieffer minimal grammars is known to 
be NP-hard [6]. We conjecture that computing {\B{-)\,Q)- 
minimal grammars is also NP-hard. Although we sought 
to constrain the grammar minimization in Theorem 15.11 to 
smaller domains of grammars C t/, it is a question of 
future research to decide whether any of these n/)- 
minimal codes is tractably computable in a sufficiently good 
approximation. This problem is important since experimental 
comparisons of vocabulary sizes can be done only for effi- 
ciently computable grammar transforms. 

It is natural to ask whether a converse of thesis (I) is true. 
Suppose that the minimal grammar-based compression of an 
n-letter long string applies rn different nonterminals. Does it 
mean that the string describes roughly to log n independent 
facts in a consistent way? We deem that such a conclusion 
is not sound even if we accept the most permissive formal 
notion of what independent facts are. Counterexamples might 
be sought at several levels. 
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First of all, it is known that the redundancy H^{n) — H{n) 
of any universal code cannot be bounded by a process- 
independent sublinear function [50]. Hence the difference 
E^{n) — E{n) cannot be universally bounded by for any 
fixed f3 e (0, 1), cf. Appendix |III] 

The most natural reasoning is, however, simpler Suppose 
that the vocabulary size grows like (|72] | only for processes for 
which inequality (|65] | is true. The idea of a counterexample to 
this proposition is obvious. If there is a strongly nonergodic 
process for which (iTZt holds then the same holds also for some 
ergodic component of the process and this component forms 
the counterexample by virtue of being ergodic. 

A possible line of deprecating this counterexample is to 
reply that almost every ergodic component of a strongly noner- 
godic process does describe an infinite sequence of facts which 
is algorithmically random rather than IID in the probabilistic 
sense. Thus the appropriate formalization of thesis (I) should 
be rather an analogue of Theorem |5T| for individual sequences 
(i.e., individual texts) within the algorithmic information the- 
ory [76], [32], [34], [51]. An analogue of Definition O should 
be provided for individual sequences as well. 

We conjecture, however, that the converse of the hypothet- 
ical analogue of Theorem 15.11 for individual sequences is not 
true, either It can be easily shown that the vocabulary size of 
Yang-Kieffer codes based on irreducible grammars grows as 
r2(-\/n/ logrt) on an algorithmically random input of length ?i, 
cf. Appendix HV] Although the minimal grammar-based codes 
introduced in this paper seem to compress algorithmically 
random strings of a fixed length much better, they may 
asymptotically behave in a similar way. 

VI. Examples of strongly nonergodic processes 

In this section we shall examine strongly nonergodic pro- 
cesses in more detail. In particular, we will construct a process 
that satisfies the hypothesis of Theorem 15. II That construction 
will be completed in Subsection IVI-DI 

According to Definition 11.11 a strongly nonergodic pro- 
cess is such a stochastic process {Xi)i^z that there ex- 
ist independent equidistributed binary variables Zk, k = 
1,2,3,..., asymptotically predictable in a shift-invariant way 
given {Xi)i^z- That is, 

lim P{sk{Xt+i..t+n) = Zk) = 1 (80) 

n — *oo 

for all lags t, all indices k, and certain functions s^. 

According to Theorem 14.31 a stationary process is strongly 
nonergodic if and only if its shift-invariant cr-field con- 
tains a nonatomic sub-a-field. In consequence, all stationary 
strongly nonergodic processes are nonergodic. Definition 11.11 
and example (|4]i provide an insight into how linguistically 
motivated nonergodic processes may look, which seems more 
concrete than previous discussions of nonergodicity in lan- 
guage modeling [31], [77]. Nonetheless, nonergodic processes 
received much attention in information theory at a very funda- 
mental level [78], [79], [80], [81], mostly from the viewpoint 
of the general ergodic decomposition theorem [82], [66], [83, 
Theorem 1.4.10], [53, Theorem 9.10-12]. 

Moreover, strongly nonergodic processes are not the first 
specific subclass of nonergodic processes to be introduced. 



Conditionally IID processes are a likewise subclass which 
has been researched since long [84], [85], [86], [87], [88], 
[89], [90]. A stationary process is called conditionally IID if 
its random ergodic measure is supported on the set of IID 
process measures. The historical predecessor of the ergodic 
decomposition theorem, known as the de Finetti theorem, 
states that each exchangeable process is conditionally IID [89]. 

There exist strongly nonergodic processes which are condi- 
tionally IID. Several of our examples will be of such a form. 
The conditionally IID processes do not seem, however, to 
provide reasonable models for natural language. Let us define 
a more plausible subclass of strongly nonergodic processes: 

Definition 6.1: A strongly nonergodic process {Xi)i^2, is 
called an accessible description process if the variables Z^, 
fc € N, are conditionally independent given any finite block 

Xn:m, U < m. 

This condition is satisfied for example (|4|l. Using the acces- 
sibility condition allows in particular to compute an asymp- 
totic expression for the n-symbol excess entropies E{n) if 
{K,),ez - IID and P{K, fc) oc 

In a linguistic interpretation, accessible description pro- 
cesses correspond to collections of texts that avoid describing 
facts in a cryptic or hermetic way. When a fact can be inferred 
from such a text, any knowledge of other facts does not 
improve this inference. The text is self-contained and there 
can be only little ambiguity of the knowledge conveyed by 
the text. 

Compare it with a cryptic way of writing. Like in alchemical 
treaties, the writer can adopt some specific expressions that 
replace the words known to the reader. As a result, the 
knowledge conveyed by the text is inaccessible to the reader 
unless the reader knows the random key to the text. If the 
key exists, of course, viz. the famous cases of the Voynich 
manuscript or Codex Seraphinianus [91], [92]. 

Let us state explicitly that the conditional independence is 
assumed in Definition 16.11 only for the variables mentioned in 
Definition 11.11 For some strongly nonergodic processes, there 
may be more independent variables Zk having property (l80b 
than (Zfc)fcgN but they need not be conditionally independent 
given blocks Xn-.m- Cryptic and expUcit ways of referring 
to the described world may also coexist in natural language. 
Moreover, each strongly nonergodic process is an accessible 
description process in an asymptotical sense, i.e., variables 
Zk are conditionally independent given the whole process 
{Xi)i^z,- 

A. A mixture of Bernoulli processes 

Presenting a strongly nonergodic process over a finite al- 
phabet is easy if we do not require the power-law growth 
of the n-symbol excess entropies E{n). We can consider 
a conditionally IID process which is an uncountable mixture 
of IID processes. Let (^x\^^^ denote a Bernoulli process 

with parameter p, < p < 1, namely, a binary IID process 
having the marginal distribution 
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Now we can construct a process {Xi)i^z such that 



(81) 



for a real variable Y supported on [0, 1] with continuous 
distribution P(Y < p) ^ := f{x)dx. 

By Theorem 14.31 (Xi)igz is an uncountable description 
process since Y equals lim„ Sr=i -^i ^Irnost surely and 
thus is measurable against the completed shift-invariant a- 
field. The suitable Zk satisfying (ISOl l may be constructed as the 
consecutive digits of the binary expansion of the distribution 
function * taken at the value of Y, = Y.t=i ^k"^^^- If 

Y has the beta distribution with nonnegative rational parame- 
ters, (Xi)igz is the well-known Polya urn process [86], [87], 
[90]. On the other hand, we have Y ^ ^{Y) ^YX=i^k'i^^ 
provided Y is uniformly distributed on [0, 1]. 

Regardless of the distribution of Y, the pace of inferring 
the hidden values of Zk does not come close to a power-law, 
cf. [73]. Let us observe that blocks Xi-n are conditionally 
independent from futures (Xi)i>„ given the block sums Sn '■= 
Yl^=i^i- Hence in view of the information inequality for 
Markov chain Xi.n -> 5„ = (X()«>n —> 

{Zk)k&h we have 

I{X\:n'i {Zk)ken) — I{Sn] {Zk)keN) < H{Sn), 
E{n) ~ I{Xi;n',Xn+l:2n) = I {Sn, Xn+i:2n) < H{Sn)- 

Moreover, H{Sn) < log(n+ 1) since Sn takes only n + 1 dis- 
tinct values. The equality arises for the uniformly distributed 

Y since then 



P{Sn = S) 



1 



for s — 0,1, ...,n. 

A similar behavior may be observed for conditionally IID 
processes over any other finite alphabet X = (0, 1, I? — 1). 
The role of Sn is played by the block type, namely, the tuple 
of random variables S„ = [S^, S\, S^^^), where 5^ is 
the number of occurrences of value i in the block Xi-n- 
By the Markov chain Xi-n ^ S„ — > {Xi)iyn, we have 

E{n) = I{X^.,n\Xn+l:2n) < H{Sn) < Dlog{n+l). This 

bound can be related to the expression for the minimax regret 
in exponential famihes [93, Theorem 7.2], [94]. 

B. Processes over an infinite alphabet 

The previous example looked like estimating an unknown 
real valued parameter p of the Bernoulli process in the 
ordinary setting of Bayesian statistics. When the alphabet X 
is infinite, a different type of stronger dependence can arise 
in conditionally independent strongly nonergodic processes. 
Although we can always clump variables Zk together into 
a single uniformly distributed real variable Y — X^fc^i ^fc2^'", 
it can be sometimes quite unnatural to think that there is any 
specific number of real parameters to be estimated. 

For instance, let the alphabet be X = N x {0, 1} and let 
(Xi)igz take form 



just as in the initial example (IDl. This process is conditionally 

IID if {Ki)i^z IID and {Zk)kefi and {Ki)i^z are indepen- 
dent. 

Additionally, ( |82] | constitutes a strongly nonergodic process 
if P{Ki = k) > for all fc e N. Let us write m C w when 
a sequence or a string v contains a string u as a substring. 
For X = N X {0, 1} and w e X^ U X*, we may define the 
predictors as 

{0 if (fc,0) E w and (fc,l) g w, 
1 if (fc,l) C w and (fc,0) g V, (83) 
2 else. 

A particularly interesting case arises when variables Ki are 
zeta distributed, namely. 



p{K, = fc) = k-'/^/ar'), 



(84) 



where (3 G (0, 1) and ({x) = X^fcLi i^ ^^^a function. 
In this case, the n-symbol excess entropy E{n) grows as 
a power-law. To deduce the latter proposition, let Us{n) be 
the set of well predictable facts, defined in (O. In view of 
equality 

P{sk{Xi.,n) - Zk) - P{K, = k for some z e {1, ...,n}) 
= 1-[1-P(A', = fc)]", 

we have k e Us{n) if and only if = fc) > 

This yields 

Usin) D {fc e N : P{K, ^ k) > -n,-^ log(l - 6)} 
by inequality 1 — a;^/" < — n 



card Us (n) > 



^ log X for X > and hence 



(82) 



-C(/3-^)log(l-<5)_ 

Thus the power-law growth of E{n) follows by (|9]l. A more 
precise calculation of E{n) is presented in the next subsection. 

As indicated in Section|I] example (|82] | may be generalized. 
Keeping intact the alphabet X = N x {0, 1}, independence 
{Zk)kefi -U- {Ki)ii^z, and guessing functions ( [83] l. one may 
admit {Ki)i^z to be any ergodic stationary process assuming 
values in natural numbers so that P{Ki = fc) > for all 
fc € N. It is easy to prove that such a process {Xi)i^i is 
strongly nonergodic. The proof applies the ergodic theorem 
[83, Theorem 1.3.1] and the fact that the almost sure conver- 
gence implies convergence in probability [95, page 188]. 

The logical interpretation of this process was also mentioned 
in Section IJ Namely, (Xi)jgz can be imagined as a sequence 
of consecutive logical statements extracted from a random 
collection of texts. Each statement of form Xi — {k, z) 
explicitly asserts that z is the value of the fc-th bit of some 
abstract state of affairs {Zk)kefi- Regardless of the choice of 
described facts, determined by process {Ki)i^z, the statements 
are always logically consistent. Namely, if two propositions 
Xi — (fc, z) and Xj — (fc', z') describe the same bits (fc — k') 
then they always report the same value {z — z'). 

Notwithstanding this, strict logical consistency of statements 
Xi is not needed to yield a strongly nonergodic process. For 
example, let us extend the probability space with an ergodic 
process (C/j),ez such that P(C/j = 1) > 1/2 and {U.i).,ez -U- 
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{Zk)kefh {Ki)iez- For the processes {Zk)kefi and {Ki)i(zi as 
previously, we set 



(85) 



where ® is addition modulo 2 (XOR). Such variables Xi — 
(fc, 2;) can be interpreted as statements which describe [Zkjkefi 
but are not necessarily true. The noise in the statements can 
be filtered out by using guessing functions 

{0 \i Nko{xi..n)> Nkl{xi..n), 
1 \f Nkl{xi.,n) > Nko{xi.,n), (86) 
2 else, 

where Nkz{xi:n) is the number of i G such that 

Xi — {k, z). Hence it can be shown easily that the process of 
(l86T l is also strongly nonergodic. 

C. Accessible description processes 

Both (|82] | and (ISST i form accessible description processes. 
This observation opens an easy path to compute n-symbol 
excess entropies. Assume that (Xj),;^^ is an accessible de- 
scription process and the block entropy is finite. By 
the chain rules, 

E{n) ~ I {Xi;n; Xn+l:2n', {Zk)keN) 

:2n I }k^N } ■ 

The conditional mutual information is nonnegative, whereas 
the triple mutual information satisfies 

I {Xi;n', Xn+l:2n] {Zk)keT>l) 

— 21 {Xi;n- (^fe)feeN) — I {Xi;2n', {Zk)keN) 

= J2k'=l['^HXl:n; Zk) — I{Xi;2n] Zk)] 

= Sfc^l ^«+l:2n; ^/c) (87) 

for conditionally independent variables Zk since 



I {Xi;n; {Zk)km) = ^ l{Xl;n] Zk\Zl;k~l) 
k=l 

oo 

"[H{Zk\Zi..k-i)~H{Zk\Xi..n,Zi..k-i) 



^\HiZk) - H{Zk\Xl:n)] = V /(Xi:„; Zk). (88) 



fc=i 



Theorem 6.2: Let {Ki)i^z, be IID variables with the distri- 
bution (|84] |. (3 G (0, 1). The n-symbol excess entropies of the 
process (Xj^gz given by (|82] | obey the asymptotic law 



EH ^ (2-2/^)r(l-/^)log2 

Proof: Since / (Xi:„; X„+i:2„|(^fe)feeN) vanishes, we 
may write E{n) = Y.kLi I i^i-.n, Xn+i:2n; Zk) by formula 
(ISTT i. Noticing that 

i?(Zfc|Xi,„) = (log 2) • ^ k for all ^ e {1, ...,n}) 

+ • P{Ki = k for some i e {1, ti}), 
/(Xi:„; Zfe) = (log 2) • P{K, = k for some ^ e {1, ...,n}) 
= (1-[1-P(i^, = fc)]")log2, 



we obtain the triple mutual information 

/(Xl:„; X„+i.,2n; Zk) = (1 - [1 - PiK, - fc)]")2 log 2. 

Hence E{n) equals up to a small constant to the integral 



(log 2)^" (^1- (^1-t47^ ^ & 
(3[Anf{log2) ^ 



fcl//3 



(1 - uffn{u)du, 



(1-A)" 

where A := l/C(/3"^) and 

/„(u) := m1/"-1[7i(1 - 
By the de I'Hopital rule, 

lim fn{u) = f{u) ■.= u-\~logu)-<^^+^\ 

n — ^oo 

The product (1 — u)'^f{u) has a pole only at 0, where it 
integrates like (— logu)~^''+^-'d(— logu). Hence, {l—u)^f{u) 
is integrable and 

E{n) _ /31og2 (1 - u)^du 

„^ - [^(/3-i)]/3 u(-logM)/3+l 

by the dominated convergence theorem. The further substitu- 
tion t = — log u yields 

'■I {l-u)^du 



M(-logu)/3+l 7o 

= (2-2^)r'r(i-/3), 



where integrals 

- /o°°(-fce-^-*)(-/3-i)i-/3di = -fc/3/3-ir(l - /3) 
can be safely integrated by parts for the considered (3. 



D. Coding into a finite alphabet 

In this section we will construct a process that satisfies the 
assumptions of Theorem 15.11 The process will be denoted 
as {Yi)iez and will be given as a stationary variable length 
coding of the process (l82Ti-(l84b. The desiderata for the process 
{Xi)iei are as follows: 

(a) {Yi)iez is a process over a finite alphabet Y = 

{q,i,...,z?-i}, 

(b) (Yi)i^z is stationary, 

(c) {Yi)i£z has finite energy, and 

(d) there exist independent equidistributed binary random 
variables {Zk)keN, P{Zk = z) ^ 1/2, z e {0,1}, 
measurable against the shift-invariant u-field of (Fi)igz 
such that 



liminfn"'^ |[/j(n)| > 



(90) 



holds for a certain j3 € (0,1), all 5 £ (1/2,1), and 
the sets Us{n) :=_ {k e N : P {sk (F") = Zk) > S} of 
well-predictable Zk's, where functions Sk satisfy 



lim P {sk ^Zk)^l, V^ e Z. 



(91) 



16 



Properties (b)-(d), but not (a), are satisfied by the process 
(l82]i-([84l). We have supposed that a suitable distribution over 
a finite alphabet can be constructed as the stationary mean of 
a certain encoding of the process (|4|i. To explain what it means 
we must introduce two more concepts. 

Firstly, consider a function / : X ^ Y* that maps single 
symbols into strings. We extend it to /* : X* ^ Y*, : 
U Y*, and : ^ Y^ U (Y* x Y*), defined as 

nx") -.^ fixi)f{x2)...f{xn), (92) 
f{x^) f{x,)f{x2)f{xs)..., (93) 
f{x^) ...f{x^i)f{xo).f{xi)f{x2)..., (94) 

where Xi E X. (The bold-face dot separates the 0-th and the 
first symbol.) 

Next, having denoted the shift operation as T{x^) := 
...XqXi.X2X^... = a measure ^ on {X^,X^) is 

called asymptotically mean stationary (AMS) if the limits 

^ n — 1 

/2(A) - lim - VmoT-^^) (95) 

n^oo 11 ^ — ^ 
i=Q 

exist for all A e X^, cf. [74], [96]. The limit ft, if it exists as 
as a total function K, forms a stationary measure on 

(X^, X'^), i.e., fLoT^^ = p., and is called the stationary mean 
of fi. Every stationary measure is AMS [74]. Moreover, for 
an AMS measure fi measure fi o (/^) is AMS under mild 
conditions, cf. [74, Example 6], [97]. 

The following proposition has been proved in [97]: 
Theorem 6.3: Let /i = P((Xi)igz G ') be the distribution 
of the process ([82l)-(|84l) and put Y = {0,1,2}. Consider 
a coding function / : X i-^ Y+ given as 

/(fc,z) = 5(fc)z2, (96) 

where lb{k) G {0, 1}^ is the binary representation of a natural 
number k. The process {Yi)iei, distributed according to the 

stationary mean P{{Yi)i,z.z G •) — M ° (/^) ^ satisfies 
conditions (a)-(d) for C(/3~^) > 4. Variables Zk may be 
constructed as Zk — Sfc((^^)iGz), where 

{0 if 26(fc)02 C w and 26(fc)12 % w, 
1 if 26(fc)12 C w and 26(fc)02 g w, (97) 
2 else 

for w e Y^U Y*. 

Inequality CiP^^) > 4 holds for f3 > 0.7728... and comes 
from satisfying condition (c). Mind that processes {Yi)iez and 
{Xi)i(zz live on different probability spaces, say (fl, J', P) and 
(fi, J , P) respectively. 

VII. The outlook 

This paper provides an explanation for the distribution of 
words in natural language as a joint effect of the narrative 
repetitions in texts and the randomness of the described world. 
Besides developing a class of less redundant grammar-based 
codes, an important development of this work is a formal 
model of the repetitive knowledge, as it can be conveyed by 
texts in natural language. 



We have brought together several research lines in linguis- 
tics and information theory, using very different concepts and 
terms. Thus it is worth resuming our main result in plain words 
at the expense of certain simplification. There seems to be 
little consolation that linguists can find in observing Zipf's 
law, yet nothing scary is there either that could shake their 
preconceptions about human communications. The Martian 
scientist, speculated by G. K. Zipf in the passage quoted in 
the introduction, may not infer that human texts convey any 
timeless knowledge by counting the letter chunks. Nonethe- 
less, there holds a converse implication. 

A sufficient explanation for Zipf's law can be provided by 
the notion that human utterances convey some general knowl- 
edge that is mostly logically consistent, a priori unknown but 
learnable, and repetitive but potentially infinite. Moreover, the 
number of distinct letter chunks obtained by minimal universal 
grammar-based coding, provides an upper bound on the total 
amount of repetitive knowledge expressed in the text. It does 
not matter whether a part of this knowledge is specific, or 
general, or objective, or subjective, or discovered, or created. 

Hence we find interesting the large scale experiments where 
a short tail of the distribution of words and set phrases can be 
observed [29]. These experiments indicate a limitation of the 
active vocabulary of a single person, as opposed to the vocab- 
ulary of a language. If a similar exponential tail were observed 
for the nonterminals of the minimal grammar-based code, it 
would imply a limitation on an individual person's memory. 
Observing a comparably httle number of distinct nonterminals 
in the grammar-based compression of the Voynich manuscript 
could also corroborate the hoax hypothesis [91], [92]. 

It is harder to foresee what kind of mathematical research 
may be inspired by this paper, which adopts a singular bird's- 
eye view on information theory and related domains. Several 
open problems have been introduced in Section |V] Although 
we kept our argumentation within the scope of Shannon in- 
formation theory, certain prospective problems for algorithmic 
information theory have been mentioned as well. 
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Appendix I 
The excess-bounding lemma 

This paper deals with bounds on sublinear parts of functions 
which grow asymptotically at the same linear rate. Thus the 
following lemma, observed in [14] in a less general form, 
constitutes a convenient tool: 
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Lemma 1.1 (Excess-bounding lemma): Consider a function 
G : N ^ R such that G{n) > for all but finitely many n 
and liirife G{k)/k — 0. For any A e N, we have 

lim sup [AG{n) - G{An)] > 0. (98) 

n — 'oo 

Proof: For n, m G N, we have identity 

AG(A''n)-G(A''+^n) ^, , GM^n) 

Ifc+i = ~ " ■ 



/c=0 



Hence G(n)-nlimfe G(fc)/fc > implies Y.'^^q{AG{A^ n) - 
G(A'=+in)]/A'=+i > 0. Putting n = A^Af, we obtain 



^ [ylG(A'=Af) - G(A'^'+iM)] 

A;— p 



^fc+i 



> 



(99) 



for any Af e N and all but finitely many p e N. If (|98Jl did 
not hold then we would have AG{A^M) - G{A^"^^M) < 
for all k greater than some p. This contradicts, however, ( l99b . 

■ 

As a special case, consider functions Gi(n) > G2{n) > 
and their excess values Fi{n) = 2Gi{n) — Gi{2n). If the 
functions have equal limits lim„ Gi (n) /n ~ g < oo then 



lim sup [Fa (n) - Fi{n)] > 



(100) 



follows from Lemma [TTTI By inequality limsup„ (a„ + &„) > 
limsup„ a„ + liminf„ 6„ for arbitrary two sequences (a„) and 
(&„), inequality (II 00b entails the implication 



lin,inf^>0 



P2{n) 
limsup 7T— > 0. 



Moreover, if Gi (n) = gn + Gi (n) then 



liniinf^>0 



F2M 
limsup > 0. 



(101) 



(102) 



The latter statement is less obvious so here comes a brief 
justification. The left-hand side implies that G{n) — G2{n) — 
gn — Bn'^ > for all but finitely n and a certain B > 0. Then 
it suffices to apply Lemma [TTI to obtain the right-hand side. 

Inequality dlOOl ) is not so easy to refine, even given 
some additional assumptions about the functions Gi{n) that 
we can easily assure in our applications. The lower limit 
lim inf „ [F2 (n) — Fi{n)] cannot be bounded so easily. For 
example, let Gi(n) — H{n) be the block entropy and 
G2{n) = H'-^{n) be the expected code length ( fTOl ). Then 
Gi is nondecreasing and concave whereas it is reasonable 
to assume that G2 is nondecreasing but only subadditive, 
i.e., G2(n + m) < G2(n) + G2(m). Consequently, Fi is 
nondecreasing and tends to lim„ [Gi{n) — ng] while F2 can 
oscillate between and Fi in the worst case, as the following 
proposition assures. 

Theorem 1.2: For any nonnegative, increasing, and con- 
cave function Gi such that lim„Gi(n) — 00, there exists 
a nondecreasing and subadditive function G2 > Gi such that 
lim„ G2{n)/n = lim„ Gi{n)/n and liminf„ F2{n) = 0. 

Proof: We will construct a G2 which is constant and 
linear on alternating intervals. Since Gi is unbounded, there 
exists an infinite sequence of arguments (6i)ieN, where bi :— 



1 and bi+i := min{n G N : Gi{n) > 2Gi(6i)}. In the next 
step, let us define 

G2(n) := min{nGi(&i)/6,;, Gi(6i+i)} 

for hi < n < hi+i. 

This construction satisfies the required properties for the 
following reasons. Since Gi is subadditive by concavity, we 
have Gi(6i+i) > 2Gi(6i) > Gi{2bi). Hence > 2bi since 
Gi is increasing. Moreover, 

F2(&,) = 2Gi(60 - min{2Gi(6,), Gi(6,+i)} - 0. 

Thus lim inf „ F2 (n) = 0. Inequality G2 > Gi holds since 
Gi is growing and concave. G2 is subadditive since G2{n)/n 
does not increase with n [98, Theorem 7.2.4]. Finally, since 
G2 and Gi are both subadditive and are equal on the infinite 
sequence we have lim„ G2 (n)/?!, = inf„ G2(»T.)/n — 

inf„ Gi{n)/n = lim„ Gi{n)/n by the Fekete lemma. ■ 

Appendix II 
Bounds for the longest repeat 

Let us review several bounds for the maximal length of 
a repeat in a string, defined in ( |20l ). First of all, if the alphabet 
is a finite set X = {0, 1, ...,Dx - 1} then 



L{w) > logjj \w\ - logjy log^ - 1 



(103) 



for any string w G X*. This bound is justified by the 
observation that if w can be split into at least _D^ + 1 substrings 
of length n then at least two substrings must be identical. The 
right-hand side of (I103l l equals one of possible n's. 

Bounding the maximal repeat length above by a sublinear 
function is impossible with respect to certain classes of proba- 
bility measures. For any function g{n) with limn g{n)/n = 
there exists such a stationary process (Xi)igz that 

limsupL(Xi:„)/(7(n) > 1 a.s. 

n — >oo 

(i.e., almost surely) [99]. Nevertheless, a strong upper bound 
exists for quite a large class of processes: 

Definition 2.1: {Xiji^z is called a. finite-energy process if 



P{X. 



n+l:n+m 



\Xi..n) < Kc"' a.s. 



(104) 



for n, m G N and certain constants c < 1 and K. 

Lemma 2.2: Let {Xi)i^z be a finite-energy process. We 
have 

supE f ^(^^-"^ \ g>o, (105) 



limsup — ; < A a.s. 



logn 



(106) 



for a constant A < 00. 



Remark: Lemma 12.21 is true for any countable alphabet X, 
also if {Xi)i(zz is not stationary. Finite-energy processes can 
be obtained by dithering ergodic processes with an IID noise 
[36]. 

Proof: The almost sure part was shown by [36, Theorem 
2]. It remains to demonstrate the bound in expectation. Assume 
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( 11041 ) and consider a j > i > 0. Applying the idea of [100], 
let us notice that 

P{^j + l:j+k = ^i+l:i+fc) 
= ^ P{Xj+l.,j+k = Xi+l:,:+fe|Xl:i = w)P{Xl.,j = W) 

Hence 

P(L(Xl:„) > fc) = P(3o<i<j<n-fcXj + l:j-(_fc = Xi+l-iJ^k) 

< ^ PiXj+l;J+k = Xi+l;i + k) 

0<i<j<n-k 

^ {_n_k)(n_k-l)^^, ^ n^Kc'^ 
2 - 2 ' 

This bound is nontrivial for k > A := (21ogn + log if — 
log 2)/ logc^^. Consider a sufficiently large n so that A> 1. 
Inequality (1105b follows from the series of inequalities 



E (L(Xl:„))« < A« + ^ fc9P(L(Xl:„) > k) 

k>A 

oo oc 

< ^(fc + Afc'' < A« J2{k + iyc\ 

k=0 k=0 

where J^kLoi^ + 1)'^'^'' < 



Appendix III 
Excess lengths of universal codes 

Let X = {0, 1, Dx - 1} and Y = {0, 1, Dy - 1} be 
the input and the output alphabets. Denote the expected length 
of code C : X+ ^ Y+ as 

H^{n) ■.= E\C{Xi,a)\logDY 

and its excess E'~''{n) :— 2H'~^ [n) — H'-^{2n) as previously. 

Universal codes are those uniquely decodable codes that 
achieve the best possible compression rate lim„ H'-^ {n)/n — h 
on the average, whereas inequalities H'~''{n) > H{n) > hn 
hold in general. By Lemma 11.11 we obtain that 



lim sup 



E^{n)-E^{n) > if {■) > H'^ {■) (107) 



for any universal codes C and C. Thus the search for the 
shortest codes reduces to the task of finding universal codes 
that enjoy the smallest excess code length. 

By Lemma 11.11 the excess code length is bounded below 
by the n-symbol excess entropy. 



lim sup [E'^'in) - E{n)] > 0. 



(108) 



This bound is not so strong in view of such a fact: 

Theorem 3.1: Let (3 € (0, 1). The following statements are 
equivalent: 

(i) H'^{n) - H{n) < Anl^ holds for an A > and all n. 

(ii) E^{n) - E{n) < Bnl^ holds for a B > and all n. 



Proof: If (i) holds then (ii) holds for B = 2A since 
H'-^{n) — H{n) > 0. Conversely, if (ii) is true then 

rrCf . . ^ E^{2^n)-E{2^n) 
H {n)-H{n) = ^ 

fc=0 



<^Bn^2'=(''-i)-i < 



k=0 



2(1 -2/3-1) ■ 



Fix some /3 G (0,1). For each universal code there exists 
a stationary process for which statement (i) is not true [50]. 
Thus proposition (ii) is false in the same case, as well. 

Using the ergodic decomposition of excess entropy, we 
can prove that sup„ E'-' [n) is finite only for countably many 
ergodic processes. This fact has been mentioned without proof 
in [3, Theorem 6]. On the other hand, E{n) is bounded by the 
finite excess entropy for uncountably many ergodic sources. 
For instance, E < oo for all irreducible Markov processes. 
Denote the expectation of the excess code length as 

E^in) :=E (2|C(Xi.„)| - \C {Xi.,2n)\) \og Dy , 

taken with respect to a measure /i = P{{Xk)kez E ■) E S. 

Theorem 3.2: For a finite alphabet X and a universal code 
C, let N^{K) be the number of distinct ergodic measures 
^ e E such that limsup„ E^{n) < K, K eR. We have 

log A^^ (if) < K 

for K>0 whereas (K) = for if < 0. 

Proof: The case of A' < is directly captured by 
inequality (1108b . As for K > 0, let us firstly inspect the 
ergodic decomposition of the expected excess code length. 

Consider the distribution of the random ergodic measure F. 
Using iy{W) := P{F e W), W e £, equation dSOll may be 
written as 

P((X,).ez e A) = J aiA)diy{a). 

The code lengths |C(w)| for w G X" can be upper-bounded 
by a natural number if X is finite. Hence 



E'^in) ^EE'^{n) 
follows by the disintegration formula 



(109) 



J fd (^J adv{a] 
j fda^ di^ia) 



for a bounded Z^-measurable function / [64, Exercise 18.19]. 
By dUll, ( fTOSl l. and ( fT09l ) we obtain 

limsupE£:^(ri) = lim sup P'^ (n) > H{T)+EEf. (110) 

n — >oo n — *oo 

Inequality dl 10b will be used in the further reasoning. 

Consider a natural number M such that M < N{K). Let 
A C E be a subset of M ergodic measures /i such that 
lim sup„ i?^ (n) < K. Let process (Xi)igz have distribution 
P((X^)^gz e •) = ^J^^^J2^,eA^^■ By *e uniqueness of 
its ergodic decomposition, the random ergodic measure F 
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takes the value of each ^ E A with probabiHty 1/M. Hence 
H{!F) = logM by [3, Theorem 2(i) and Lemma 3]. 

Take some e > 0. Random variables K + e — Ep{n), n E N, 
are almost surely nonnegative for all but finitely many n. Thus, 
by the Fatou lemma, K + e — E lim sup„ Ep (n) < K + e — 
limsup„ E i?^(n). Hence from inequality dllOl l we obtain 

logM = H{T) < limsupE£;^(n) 

< E limsup£;^(n) < K. 

n — >oo 

Since this holds for any M < N{K), the claim follows. ■ 

Appendix IV 
The vocabulary size of Y-K codes 

There is a large qualitative difference between the grammar- 
based codes which were introduced in [5] and those that 
minimize the length induced by the local encoder B which 
satisfies condition dzTl i. The empirical study of [15] compared 
the longest matching grammar transform (LMG) [5], [6], 
an irreducible transform which locally minimizes the Yang- 
Kieffer length | • |, with a similar grammar transform called 
BLMG that locally minimizes the length |-B(-)|- It appeared 
that LMG and BLMG behave in a strikingly different way in 
terms of the vocabulary size. 

The grammar transforms were applied to two novels in 
English and in PoUsh (abt. 6 x 10^ characters) and their 
unigram approximations (roughly, random permutations of the 
texts). Paradoxically, the LMG discovered more structure in 
the unigram text (abt. 6 x 10^ distinct nonterminals) than in 
the original data (abt. 3 x 10^ nonterminals). For the BLMG, 
a difference of two orders was observed but in the opposite 
direction (about 1 x 10^ nonterminals for the unigram data and 
1 X 10** for the original). 

As far as probabilistic modeling makes sense, the text in nat- 
ural language has lower entropy rate and much higher excess 
entropy than its unigram approximation. Thus the vocabulary 
size of the BLMG seems proportional to the redundancy of the 
source H{l) — h,m accordance with Theorem l3.1 II In contrast, 
the vocabulary size of the LMG appears proportional rather to 
the entropy rate h. Part (i) of the following proposition was 
mentioned by [15] as an explanation to this counterintuitive 
behavior of the longest matching transform: 

Theorem 4.1: (i) If T is an X-grammar transform then 

V[r(u;)] > ^\T{w)\/2-Dx-l. (HI) 
(ii) If r is an JF n T'-grammar transform then 

\[V{w)]L{w)> ^\r^w)\j2^Dx-l. (112) 

Remark: The notations for grammar classes are as in Sub- 
section IIII-CI except for V, which stands for the set of 
partially irreducible grammars. Grammar (ai, Q!2, ctn) is 
called partially irreducible if it satisfies conditions (i) and 
(ii) of irreducibility, as well as, each pair of consecutive 
symbols in string ai appears at most once at nonoverlapping 
positions. The LMG is an X-grammar transform and there 
exists w !FC\ P-grammar transform F which is a modification 
of the LMG. In order to compute the value of this transform 



for a string w, we start with the grammar {Ai — > w} and 
iteratively replace the longest repeated substrings u in the start 
symbol definition with the new nonterminals Ai — > u until 
there is no repeat of length |u| > 2. 

Proof: Write G = T{w) and V = V[r(w)] for brevity. 
Notice that x + a + 1 > \fyj^ follows from {y — x)/2 < 
{x + a)^ for x,y,a> 0. 

(i) In this case, any pair of symbols occurs at most once at 
the every second position of all right-hand sides of G. 
Hence, {\G\ -V)/2<{V + Dxf, which implies (fTTTT l. 

(ii) At the every second position of the start symbol definition 
in G, a pair of symbols can occur only once. Thus 
(fn2] i follows by [\G\ ~ VL{w)]/2 < {V + Dxf < 
{VUw)+Dxf. 

■ 

Consider a stationary ergodic process {Xi)i^z with an 
entropy rate h. For any X-transform F the respective Yang- 
Kieffer code G ~ i3YK(r(')) is universal so 

|F(Xi:„)| (const + log n) > hn — 21ogri 

holds for all but finitely many n almost surely. This follows by 
the code construction, Barron's inequality, and the asymptotic 
equipartition. Hence Theorem 14. II implies 

V[r(Xi:„)] > hn/ log n + const 

for all but finitely many n almost surely. 

This reasoning cannot be transferred to the case of a {B, J')- 
minimal universal code where i? is a local encoder. In the 
grammars produced by this code, the substrings may appear 
more than twice and there is no fixed upper bound on the 
number of repeats. 
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